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(54) System and method for providing remote automatic speech recognition services via a 
packet network 



(57) A system and method of operating an auto- 
matic speech recognition service using a client-server 
architecture is used to make ASR services accessible at 
a dient location remote from the location of the main 
ASR engine. The present invention utilizes client-server 
communications over a packet network, such as the 



Internet, where the ASR server recaves a grammar 
from the dient, receives information representing 
speech from the client, performs speech recognition, 
.and returns information based upon the recognized 
speech to tiie client. 
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Description 



Technical Reld 



This invention relates to speech recognition in gen- 
eral and, more particularly, provides a way of providing 
remotely-accessible automatic speech recognition serv- 
ices via a packet network. 

Background of the Invention 

Techniques for accomplishing atrtomatic speech 
recognition (ASR) are well known. Among known ASR 
techniques are those that use grammars. A grammar is 
a representation of the language or phrases expected to 
b used or spoken in a given context In one sense, 
then, ASR grammars typically constrain the speech rec- 
ognizer to a vocabulary that is a subset of the universe 
of potentially-spoken words; and grammars may indude 
subgrammars. An ASR grammar rule can then be used 
to represent the set of "phrases" or combinations of 
words from one or more grammars or subgrammars that 
may be expected in a given context "Grammar may 
also refer generally to a statistical language model 
(where a model represents phrases), such as those 
used in language understarxiing systems. 

Products and services that utilize some form of 
automatic speech recognition: ("ASR") methodology 
have been recently introduced commerdally. For exam- 
ple, AT&T has; developed a grammar-based ASR 
engine called WATSON that enables development of 
conplex ASR services: Desirable attributes of conrplex 
ASR services that would utilize such ASR technology 
indude high accuracy in recognition; robustness to ena- 
ble recognition where speakers fiave differing accents 
or dialects, and/or in the presence of background noise; 
ability to handle large vocabularies; and natural lan- 
guage understarxfing. In order to achieve these 
attributes for conplex ASR services. ASR technrcjues 
arvd engines typically require computer-based systems 
having significant processing capability in order to 
achieve the desired speech recognition capability. 
Processing capability as used herein refers to procfes- 
sor speed, memory, disk space, as well as access to 
application databases. Such requirements have 
restricted the development of complex ASR services 
that are available at one*s desktop, because the 
processing requirements exceed the capabilities of 
most desktop systems, which are typically based on 
personal computer (PC) technology- 
Packet networks are general-purpose data net- 
works which are well-suited for sending stored data of 
various types, induding speech or audio. The Internet, 
the largest and most renowned of the existing packet 
networi<s. connects over 4 nrellion computers in some 
140 countries. The Internet's global and exponential 
growth is common knowledge today 

Typically, one accesses a packet networK such as 



the Internet; tiirough a client software program execut- 
ing on a computer, such as a PC. and so packet net- 
works are inherently client/server oriented. One way of 
accessing information over a packet network is through 

5 use of a Web browser (such as ttie Netscape Navigator, 
available from Netscape Communications, Inc.. and the 
Internet Explorer, available from Microsoft Corp.) which 
enables a dient to interact with Web servers. Web serv- 
ers and the information available therein are typically 

w identified and addressed through a Uniform Resource 
Locator (URL)-compatible address. URL addressing is 
widely used in Internet and intranet appfications and is 
well known to those skilled in the art (an "intranet" is a 
packet network modeled in functionality based upon the 

15 Internet and is used, e.g.. by companies locally or inter- 
nally). 

What is desired is a way of enabling ASR services 
that may be made available to users at a location, such 
as at their desktop, that is remote from the system hbst- 
20 Ing the ASR engine. 

Summary of the tnvenffon 

A system and method of operating an automatic 

25 Speech recognition service using a dierrt^erver archf- 
tecture is used to make ASR services accessible at a 
dient location remote from the location of the main ASR 
engine. In accordance with the present invention, using 
dient-server communrcations over a packet networic. 

30 such as the Internet the ASR server receives a gram- 
mar from the dient receives information representing 
speech from the dfent. pedbrms speech recognition, 
and returns information based i^xin the recognized 
speech to the dient. Alternative embodiments of the 

35 present invention include a variety of ways to obtain 
access to the desired grammar, use of compression or 
feature extraction as a processing step at the ASR dient 
prior to transfemng speech information to tiie ASR 
server staging a dialogue between dient and server, 

4o and operating a form-filling service. 



Brief Description of the Drawing s 



FIG. 1 is a diagram showing a client-server relation- 
's ship for a system providing reimote ASR services in 
accordance with the present inventioa: 

FIG. 2 is a diagram showing a setup process for 
enabling remote ASR services in accordancie with the 
present invention. 
50 FIG. 3 is a diagram showing an alternative setup 
process for enabling remote ASR services in accord- 
ance witti the present invention. 

FIG. 4 is a diagram showing a procesis for rule 
selection in accordance with the present irrvention. 
55 FIG. 5 is a diagram showing a process for enabling 
remote automatic speech recognition in accordance 
with the present invention. 

FIG. 6 is a diagram showing an alternative process 
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for enabling renx)te automatic speech recognition in 
accordance with the present invention. 

FIG. 7 is a diagram showing another alternative 
process for enabling remote automatic speech recogni- 
tion in accordance with the present invention. 



Detailed Description 

The present invention is directed to a client-server 
based system for providing remotely-available ASR 
services. In accordance with the present invention, ASR 
services may be provided to a user - e.g., at the users 
desktop " over a packet network, such as the Intemet 
without the need for the user to obtain computer hard- 
ware having the extensive processing capability 
required for executing full ASR techniques. 

A basic client-server architecture used in accord- 
ance with the present invention is shown in FIG. 1 . ASR 
server 100 is an ASR software engine which executes 
on a system, denoted as server node 1 10, that can be 
linked across packet network 120 (such as the IntemeQ 
to other computers. Server node 110 may typically be a 
computer having processing capability suffident for run- 
ning complex ASR-based applications, such as the 
AT&T WATSON systenx Packet network 120 may, illus- 
tratively, be the Internet or an intranet 

ASR client 130 is a relatively small program (when 
conpared to ASR server 100) that executes on cfient 
PC 140. Qient PC 140 is a computer, such as a per- 
sonal computer (PC), having sufficient processing 
capability for running client applications, such as a Web 
browser. Client PC includes hardware, such as a micro- 
phone, and software for the input and capture of audio 
sounds, such as speech. Methods for connecting micro- 
phones to a PC and capturing audio sounds, such as 
speech, at the PC are well known. Examples of speech 
handling capabilities for PCS include the Speech Appli- 
cation Programmer Interface (SAPQ from Microsoft and 
the AT&T Advanced Speech Application Programmer 
Interface (ASAPi). Details of tfie Microsoft SAPi are 
found in, e.g., a put>lication entitled "Speech API Devel- 
opers Guide. Windows"* 95 Edition." Vers. 1.0, Micro- 
soft Corporation (1 995), and details of the AT&T ASAPI 
are provided in a publication entitled "Advanced Speech 
API Developers Guide." Vers. 1.0. AT&T Corporation 
(1996); each of tiiese publications is incorporated 
herein by reference. An alternative emtxxliment of the 
present invention nr^y utilize an interface between ASR 
client 130 and one or more voice channels, such that 
speech input may be provided by audio sources other 
than a microphone- 
Client PC 1 40 also has the capability of communi- 
cating witfi other corrputers over a packet network 
(such as the Internet). Methods for establishing a conv 
rriunications link to other computers over a packet net- 
work (such as the Internet) are well known and include, 
e.g., use of a modem to dial into an Internet service pro- 
vider over a telephone line. 



ASR server 100. through server node 110, and 
ASR client 130. through dient PC 140. may communi- 
cate with one another over packet network 120 using 
known methods suitable for communicating information 
5 (ifKluding the transmission of data) over a packet net- 
work using, e.g., a standard communications protocol 
such as the Transmission Control Protocol/Internet Pro- 
tocol (TCP/IP) socket. A TCP/IP socket is analogous to 
a "pipe" through which information may be transmitted 
10 over a packet network from one poirrt to anotfier. 

Establishment of a TCP/IP socket between ASR 
server 1 00 and ASR cfient 1 30 will enable the transfer of 
data between ASR server 100 and ASR cfient 130 over 
packet network 120 necessary to enable remote ASR 
15 services in accordance with the present invention. ASR 
client 130 also interfaces with audio/speech input and 
output capabilities and text/graphics display capabilities 
of client PC 140. Methods and interfaces for handling 
input and output of audio and speech are well known, 
20 and text and graphics display handling methods and 
interfaces are also well known, 

ASR client 130 may be set up to run in client PC 
140 in several ways. For example, ASR client 130 may 
be loaded onto dient PC 140 from a pernr^ent data 
2S storage medium, such as a magnetic disk or CD-ROM. 
In the alternative, ASR client 130 may be downloaded 
from an information or data source locataWe over a 
packet network, such as the Internet Downloacfing of 
ASR client 130 may. e.g., be accomplished once to 
30 reside permanently in dient PC 140; alternatively. ASR 
dient 130 may be downloaded for single or limited use 
purposes. ASR dient 1 30 may be implemented, e.g., as 
a small plug-in software module for another program, 
such as a Web browser, that executes on client PC 1 40. 
35 One way of accomplishing tfiis is to make ASR dient 
130 an Mh/e-X software component according to the 
Microsoft Active-X standard. In tfiis way, ASR dient 130 
may. e.g.. be toaded into dient PC 140 in conjunction 
with a Web browsing session as follows: a user brows- 
40 ing the World Wide Web using client PC 140 enters a 
Web site having ASR capability; the Web site asks the 
user permission to download an ASR dierrt module irtto 
dient PC 140 in accordance with signed Active-X con- 
ti-ol; upon the user's autiiorization, ASR dient 130 is 
45 downloaded into client PC 140. Similarly, ASR server 
100 may be set up to run in server node 1 1 0 in several 
ways, for exanple. ASR server may be loaded onto 
server node 100 from a permanent data storage 
medium, such as a magnetic disk or CD-ROM, or. in the 
so alternative. ASR server 100 may be downloaded from 
an information or data source locatable over a packet 
netwofK such as the Internet 

Further details of providing remote ASR services in 
accordance with the present invention will now be 
55 described with reference to FIGS. 2-7. It is presumed for 
the discussion to follow with respect to each of tiiese fig- 
ures Uiat the dient-server relationship is as shown in 
FIG. 1. A setup phase is used to prepare ASR server 
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100 and ASR client 130 for performing an autonnatic 
speech recognition task as part of an ASR application. 
For convenience, items shown in FIG. 1 and appearing 
in other figures will be identified by the same reference 
numbers as in FIG. 1 . 

Referring now to FIG, 2. a setup phase in a process 
of providing remote ASR services will rx>w be 
described. At step 201. ASR dient 130 receives a 
request from the application to load a client grammar 
The client grammar is illustratively a data file containing 
information representing the language (e.g., words and 
phrases) expected to be spoken m the context of the 
particular ASR application. The data file may be in a 
known format, such as the Standard Grammar Format 
' (SGF) which is part of the Microsoft SAPL 

For purposes of illustration, an ASR application for 
taking a pizza order will be used in descnljirig the 
present invention. An ASR service application, such as 
an application for pizza-ordering, would typrcally include 
a program that interfaces with arxi uses ASR client 130 
as a resource used for acconplishing the tasks of the 
ASR application. Such an ASR application could reside 
and execute, in whole or in part In dient PC 140. 

Considering the pizza ordering example, the dient 
grammar PIZZA would indude information representing 
words that one may use in ondering pizza, such as 
"pzza," Toepperoni," etc. In fact, subgrammars may be 
used to txjild an appropriate grammar. For the pizza 
ordering example, subgrammars for the PIZZA gram- 
mar could include SIZE and TOPPINa The subgram- 
mar SIZE could consist of words used to describe the 
size of the pizza desired, such as "small," "medium" and 
"large." The subgrammar TOPPING may consist of 
words used to describe tiie various toppings one may 
order with a pizza, e.g.. "sausage." "pepperoni," "mush- 
room" and the like. 

ASR client 1 30 may be given the desired grammar 
from the application or, alternatively, ASR client 130 
may choose the grammar from a predetermined set 
based upon information provided by the application. 
Eitiier way. ASR dient 130 then at step 202 sends the 
desired grammar file to ASR server 100 over a TCP/IP 
socket, A new TCP/IP socket may have to be set up as 
part of estat>lishing a new communications session 
between dient PC 140 and server node 100. or the 
TCP/IP socket may already exist as the result of an 
established communications session between dierrt PC 
1 40 and server node 1 10 that has not been terminated. 
In the pizza ordering illustration, ASR client 130 would 
cause transmission of a file containing the PIZZA gram- 
mar to ASR server 1 00 over a TCP/IP socket. 

At step 203, ASR server 100 receives the dient 
grammar sent from ASR dient 130 and, at step 204, 
ASR server loads the transmitted diertt grammar. /Vs 
used herein, loading" of the dient grammar means to 
have the grammar accessible, for use by ASR server 
100. e.g. by storing the grammar in BAM of server node 
110. At step 205. ASR server 100 returns a grammar 
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"handle" to ASR client 130. A grammar "handle" is a 
marker, such as. e.g.. a pointer to memory containing 
the loaded grammar, that enables ASR client to easily 
refer to the grammar during the rest of the communica- 
tions session or application execution. ASR dient 130 
receives the grammar handle from ASR server 100 at 
step 206 and returns the handle to the application at 
step 207. For the pizza ordering example. ASR server 
100 would receive and load the transmitted PIZZA 
grammar file and transmit back to ASR dient 130 a han- 
dle pointing to the loaded PIZZA grammar. ASR dierrt, 
in turn, would receive the PIZZA handle from ASR 
server 100 and return the PIZZA handle to the pizza 
ordering application. In this way. the application can 
simply refer to the PIZZA handle when carrying out or 
initiating an ASR task as part of the pizza ordering appli- 
cation. 

An alternative setup approach will now be 
descn*bed with reference to FIG. 3. It is assumed for the 
remainder of the description herein that transmission or 
communication of information or data Ijetween ASR 
server 100 and ASR client 130 take place over ah 
estabnshed TCP/IP socket. /\t step 301, ASR dient 130 
receives a request from the application to load a dient 
grammar. Rather than send the dierrt grammar as a 
data ffle to ASR server 100 at step 302. however, ASR 
dient 130 instead sends to ASR server 100 an identifier 
representing a "canned" grammar; a "canned grammar 
would, e.g.. be a common grammar, such as TIME-OF- 
DAY or DATE, which ASR server 100 would already 
have stored. /Mternatively, ASR dient 130 could send to 
ASR server 100 an IP address, such as a URL-compat- 
ible address, where ASR server 100 coLikJ find tii^ 
desired grammar file, ASR server 100 at step 303 
receives the grammar identifier or URL grammar 
address from ASR cfient 130. locates and loads the 
requested dient grammar at step 304, and at step 305 
returns a grammar handle to ASR dient 1 30: Similar to 
the steps descrtoed above with respect td FIG. 2, ASR 
dient 130 receives the grammar handle from ASR 
server 100 at step 306 and returns the hairxlie to the 
application at step 307. For the pizza ordering exanple, 
the steps described above in connection with FIG. 2 
would be the same, except that ASR client 130 wou»d 
send to ASR server 100 a grammar identifier for the 
PIZZA grammar (if it were a "canned" grammar) or a 
URL address for tiie location of a file containing the 
PIZZA grammar; ASR server 1 00 would, in turn, reti-ieve 
a ffle for the PIZZA grammar based upon the grammar 
identifier or URL address (as sient by the ASR client) 
and then load the requested PIZZA grammar. 

After the grammar has been toaded arxl a grammar 
handle returned to ASR dient 130. an ASR service 
application needs to seled a grammar rule to be acti- 
vated. FIG. 4 shows a process lor grammar rule selec- 
tion in accordance with the present invention. ASR 
dient 130 receives from the application a request to 
adrvate a grammar rule at step 401. At step 402. ASR 
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client sends a rule activate request to ASR server 100: 
as shown in FIG 4. ASR client 130 may also at step 402 
send to ASR server 100 the previously-returned gram- 
mar handle (which may enable ASR ser/er to activate 
the appropriate grammar rule for the particular grammar 
as identified by the grammar handle). ASR server 100 
at step 403 receives the rule activate request and gram- 
mar handle (if sent). At step 404, ASR server 100 acti- 
vates the requested rule and. at step 405, returns to 
ASR client 130 notification that the requested rule has 
been activated. ASR client 130 receives at step 406 the 
notification of rule activation and notifies the application 
at step 407 that the rule has been activated. Once the 
application receives notice of rule activation, it may then 
initiate recognition of speech. 

For purposes of illusitrating the process shown in 
FIG. 4. again consider the pizza ordering exanple. A 
rule that may be used few recognizing a pizza order may 
set the desired phrase for an order to include the sub- 
granvnars SIZE and TOPPINGS along with the word 
"pizza." and might be denoted in the following manner: 
{ORDER = SIZE "pizza" ^vith" TOPPINGS}. With refer- 
ence.again to FKx 4. ASR client 130 would receive from 
the application a request to activate a pizza ordering 
rule and send the ORDER rule set out above to ASR 
server 100 along with the PIZZA grammar handle. ASR 
sender receives the rule activate request along with the 
PIZZA grammar handle and activates the ORDER rule, 
such that the recognizer would be constrained to recog- 
nizing words from the SIZE subgrammar, the word 
"pizza," the word \vith" and words from the sut>gram- 
mar TOPPINGS. After activating the ORDER rule. ASR 
server 100 sends notification of the rule activation to 
ASR client 130 which, in turn notifies the application. 

Once a grammar rule has been activated, the 
processing of speech for purposes of recognizing words 
in the grammar according to the rule can take place. 
Refemng to FIG. 5. at step 501 ASR client 130 receives 
a request from the application to initiate a speech recog- 
nition task. At step 502, ASR client 130 requests 
streaming audio from the audio input of client PC 1 40. 
Streaming audio refers to audio being processed "on 
the fly" as more audio comes in; the system does not 
wait for all of the audio input p.e.. the entire speech) 
before sending the audio along for digital processing; 
streaming audio may also refer to partial transmission of 
part of the audio isignal as additional audio is input Illus- 
tratively, a request for streaming audio may be accom- 
plished by making an appropriate software call to the 
operating system running on client PC 140 such that 
streaming audio from the microphone input is digitized 
by the sound processor of client PC 140. Streaming 
audio digitized from the microphone input is then 
passed along to ASR client 130. ASR dient 130 then ini- 
tiates transmission of streaming digitized audio to ASR 
server 100 at step 503; like the audio input from the 
microphone, the digitized audio is sent to ASR server 
100 "on the fly" even while speech input continues. 



At step 504. ASR server 100 performs speech rec- 
ognition on the streaming digitized audio as the audio is 
received from ASR client 130. Speech recognition is 
pertormed using known recognition algorithms, such as 
5 those employed by the AT&T WATSON speech recogni- 
tion engine, and is performed within the constraints of 
the selected grammar as defined by the activated rule. 
At step 505. ASR server 100 returns streaming text (i.e. * 
partially recognized speech) as the \npui speech is rec- 
10 ognized. Thus, as ASR server 100 reaches its initial 
results, it returns those results to ASR client 130 even 
as ASR server 100 continues to process additional 
streaming audio being sent by ASR client 130. This 
process of returning recognized text "on the fly" permits 
15 ASR client 130 (or the application interfacing with ASR 
client 130) to provide feedback to the speaker. As ASR 
server 100 continues to process additional streaming 
input audio, it may correct the results of the eariier 
speech recognition, such that the returned text may 
20 actually tpdate (or correct) parts of the text already 
returned to ASR dient 1 30 as part of the speech recog- 
nition task. Once all of the strean^'ng audio has been 
received from ASR dient 130. ASR sender completes Hs 
speech recognition processing and returns a final ver- 
25 sion of recognized text (induding corrections) at step 
506. 

At step 507. ASR dient 130 receives the recog- 
nized text from ASR server 100 and returns the text to 
the application at step 508. Again, this may be done "on 
30 ihe fly" as the recognized text comes in, and ASR dient 
passes along to the application any corrections to rec- 
ognized text received from ASR server 100. 

Refemng to the pizza ordering exanple. once the 
ORDER rule has been activated and the application 
35 notified, ASR dient 130 will receive a request to initiate 
speech recognition and will initiate streaming audio 
from the microphone input The speaker may l>e 
prompted to speak the pizza order, and once speaking 
begins. ASR dient 130 sends digitized sti-eaming audio 
40 to ASR ser/er 100. Thus, as the speaker states, e-g.. 
tiiat she wants to order a "large pizza with sausage arKl 
pepperoni." ASR dient 130 will have sent digitized 
streaming audio for the first word of the order along to 
ASR server 100 even as the second word is being spo- 
ken. ASR server 100 will, as tiie order is being spoken. 
\ return the first vyprd asjext. "large". as thq. re^t.pf.the , 
order is being spoken. Ultimately, once the speaker 
stops speaking, the final recognized text for the order, 
"large pizza with sausage, pepperoni" can be returned 
50 to ASR client 130 and. hence, to the application. 

An alternative embodiment for carrying out the 
speech recognition process in accordance wth the 
present invention is shown in FIG. 6. Similar to the 
speech recognition process shown in FIG. 5, at step 
55 601 ASR dient 130 receives a request from the applica- 
tion to initiate a speech recognition task and. at step 
602. ASR dient 130 requests streaming audio from the 
audio input of dient PC 140. Streaming audio digitized 
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from the microphone input is then passed along to ASR 
client 130. At step 603. ASR client 130 compresses the 
digitized audio "on the fly" and then initiates transmis- 
sion of streaming compressed digitized audio to ASR 
server 100, while speech input continues. 

At step 604, ASR server 100 decompresses the 
compressed audio received from ASR client 130 before 
performing speech recognition an the streaming digi- 
tized audio. As described above with reference to FIG. 
5. speech recognition is performed within the con- 
straints of the selected grammar as defined by the acti- 
vated rule. At step 605. ASR server 100 returns 
streaming text (i.e.. partially recognized speech) as the 
input speech is recognized. Thus, ASR server 100 
returns initial resufts to ASR client 130 even as ASR 
server 100 continues to process additional conpressed 
streaming audio being sent by ASR client 130. and may 
update or con-ect parts of the text already returned to 
ASR dient 130 as part of the speech recognition task. 
Once all of the streaming audio fias been received from 
ASR client 130, ASR server completes its speech rec- 
ognition processing and returns a final version of recog- 
nized text (including corrections) at step 606, ASR dient 
130 receives the recognized text from ASR server 100 
at step 607 as it comes in and returns the text to the 
application at step 608. 

Anotfier alternative embodiment for carrying out the 
speech recognition process in accordance with the 
present invention is shown in FIG. 7. Similar lo the 
speech recognition process shown in FIGS. 5 and 6, at 
step 701 ASR dient 130 receives a request from the 
application to initiate a speech recognition task and, at 
step 702, ASR dient 130 requests streaming audio from 
the audio input of dient PC 140. Streaming audio digi- 
tized from the microphone input is then passed along to 
ASR dient 130. At step 703, ASR dient 130 processes 
the digitized audio "on the fly" to extract features useful 
for speech recognition processing and then initiates 
transmission of extracted features to ASR server 100. 
while speech input continues. Extraction of relevant fea- 
tures from speech involves grammar-independent 
processing tiiat is typically part of algorithms enployed 
for speech recognition, and may be done using methods 
known to tiiose skilled in the art such as those based 
upon linear predictive coding (LPC) or Mel filter t)ank 
processing. Feature extractiorif provides information 
obtained from characteristics of voice signals while 
eliminating unnecessary information, such as volume. 

Upon receiving extracted features from ASR dient 
130. ASR server 100 at step 704 performs speech rec- 
ognition on the incoming features which are arriving "on 
the fly" (i.e.. in manner analogous to streaming audio). 
Speech recognition is performed within the constraints 
of the selected grammar as defined by the activated 
rule. As is the case with the emtKxjiments discussed 
above witii reference to FIGS. 5 and 6, at step 705 ASR 
server 100 returns streaming text (i.e.. partially recog- 
nized speech) to ASR dient 130 as the input features 



are recognized. ASR server 100 continues to process 
additional extracted features being sent by ASR client 
130. and may update or correct parts of the text already 
returned to ASR dient 130. ASR server completes its 
5 speech recognition processing upon recapt of all of tfie 
extraded features from ASR dient 130, and returns a 
final version of recognized text (including corrections) at 
step 706. ASR dient 130 receives the recognized text 
from ASR server 100 at step 707 as it comes in and 
10 returns tiie text to the application at step 708. 

The alternative emtxxliments described above with 
resped to FIGS. 6 and 7 each provide for additional 
processing at the dient end. For the emtxxliment in 
FIG. 6. this entails compression of the sti-eaming audio 
15 (with audio decompression at the server end); for the 
ent>odiment in FIG. 7. this induded part of the speech 
recognition processing in the form of feature exti-action. 
Using such additional processing at the dient end sig- 
nificantiy reduces the amount of data transmitted from 
20 ASR dient 130 to ASR server 100. Thus, less data is 
required to represent the speech signals Ijeing transmit- 
ted. Where feature exti-adion is accomplished at the cli- 
ent end. such benefrts are potentially sharply increased, 
because extraded features (as opposed to digitized 
25 voice signals) require less data and no featijres need be 
sent during periods of silence. The reduction of data 
produces a dual desired benefit: (1) it permits a reduc- 
tion in barxiwidtfi required to achieve a certain level of 
performance, and (2) it reduces the bBnsmission time in 
30 sending ^eech data from ASR dient to ASR server 
tiirough the TCP/IP socket 

While typically a grammar rule will be adivated 
prior to the initiation of transmission of speech informa- 
tion from ASR dient 130 to ASR server 100. rule activa- 
35 tion could take place after some or all of the speech 
information to be recognized has been sent from ASR 
dient 130 to ASR server 100. In such a drcumstance, 
ASR server 100 wouW not begin speech recognition 
efforts until a grammar rule has been adivated. Speech 
4o sent by ASR dient 130 prior to activation of a grammar 
rule could be stored temporarily by ASR server 100 to 
be processed biy the recognizer or. alternatively, such 
speech could be ignored. 

Further, multgjie speech recognition tasks may be 
45 executed using the techniques of tfie present invention, 
example, an ASR application could request ASR di- 
ent 130 to instrud ASR server 100 to load a canned 
grammar for a telephone number (i.e., "PHONE 
NUMBER") and then request activation of a rule cover- 
so ing spoken numbers. After a phone number is spoken 
and recognized in accordance with the present inven- 
tion (e.g.. in response to a pron^t to speak tfie phone 
number. ASR dient 130 sends digitized spoken num- 
bers to ASR server 100 for recognition), tiie ASR appli- 
55 cation could then request ASR dient 130 to set up and 
initiate recognition of pizza ordering speech (e.g.. load 
PIZZA grammar, adivate ORDER rule, and initiate 
speech recognition) in accordance witii the examples 
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described above with reference to FIGS. 2-5, 

In addition to the simple pizza ordering example 
used above for illustration, a wide array of potential ASR 
services may be provided over a packet network in 
accordance with the present invention. One example of 
an ASR application enabled by the present invention is 
a form-filling service for completing a form in response 
to spoken responses to information requested for each 
of a nunrber of blanks in the form, fn accordance with 
the present invention, a form-filling service may be 
implemented wherein ASR client 130 sends grammars 
representing the possible choices for each of the blanks 
to ASR server 100. For each Wank. ASR client 130 
requests activation of the appropriate grammar rule and 
sends a corresponding spoken answer made in 
response to a request for information needed to com- 
plete the Wank. ASR server 100 applies an appropriate 
speech recognition algorithm in accordance with the 
selected grammar and rule, and returns text to be 
inserted in the form. 

Other ASR services may involve an exchange of 
information (e.g., a dialogue) between server and dient 
For example, an ASR service application for handling 
fligfit reservations may, in accordance with the present 
invention as described herein, utilize a dialogue 
between ASR server 100 and ASR dient 130 to accom- 
plish the ASR task. A dialogue may proceed as follows: 

Speaker (through ASR dient 130 to ASR server 
100): 

*'! want a flight to Los Angeles." 
ASR server's response to ASR dient (in the form of 
text or, alternatively, speech returned by ASR 
server 100 to ASR dient 130): 

"Firom what city will you be leaving?" 
Speaker (through ASR dient to ASR server): 

"Washington. DC." 
ASR server's response to ASR dient: 

"What day do you want to leave" 
Speaker (ASR client to ASR server): 

Tuesday." 
ASR server's response to ASR dient: 

"What time do you want to leave" 
Speaker (ASR client to ASR server): 

"4 O'dock in the afternoon." 
ASR server's response to ASR client: 

"I can book you on XYZ Airline flight 4567 
from Washington. DC to Los Angeles on Tuesday at 
4 O'dock Pfvl. Do you want to reserve a seat on this 
flight?" 

In this case, the information received from ASR 
server 110 is not literally the text from the recognized 
speech, but is information t>ased upon the recognized 
speech (which would depend upon the application). 
Each leg of the dialogue rrjay be accomplished in 
accordance with the ASR dient-server method 
described above. As may be otjserved from this exam- 



ple, such an ASR service application requires of the 
ASR client and ASR server not only the ability to harxile 
natural language, but also access to a large database 
that is constantly changing. To accomplish this, it may 
5 be desirable to have the ASR service application actu- 
ally installed and executing in server node 110. rather 
.than in client PC 140. Client PC 140 would, in that case, 
merely have to run a relatively small "agent" program 
that, at the control of the application program running at 
10 server node 110, initiates ASR dient 130 and shep- 
herds the speech input through ASR dient 130 along to 
ASR server 100. An example of such an "agent" pro- 
gram may be, e.g., one that places a "talking head" on 
the screen of client PC 140 to assist the interaction 
j5 between an individual using the ASR service applica- 
tion at client PC 140 and. through ASR client 130 and 
ASR server 100, send the person's speech information 
along to ASR server 100 for recognition. 

In summary, the present invention provides a way of 
20 providing ASR services that may be made available to 
users over a packet network, such as the Internet, at a 
location remote from a system hosting an ASR engine 
using a dient-server architecture. 

What has been described is merely illustrative of 
25 the application of the prindples of the present invention. 
Otfier arrangements and methods can be inplemented 
by those skilled in the art without d^^arting from the 
spirit and scope of the present invention. 

Where technical features mentioned in any daim 
30 are followed by reference signs, those reference signs 
have been induded for the sole purpose of increasing 
the intelligibility of the daims and accorcfingly, such ref- 
erence signs do not have any limiting effect on the 
scope of each element identified by way of example by 
35 such reference signs. 

Claims 

1 . A method of operating an autorr^tic speech recog- 
40 nition service accessiWe by a dient ever a packet 
networK cornprising the steps of: 

a. receiving from the dient over the packet net- 
work informatiori corresponding to a grammar I 

45 used for Speech recognition; ' 

b. receiving from the cjient over the pactet net- 
work information representing speech; 

c. recognizing tiie received speech information 
by applying an automatic speech recognition 

^0 algorithm in accordance with the grammar; and 

d. sending information based upon the recog- 
nized speech over the packet network to tiie cli- 
ent. 

55 2. The invention according to daims 1 or 28, further 
comprising the step of if the information corre- 
sponding to a grammar is an address correspond- 
ing to the location of a grammar, obtaining access 
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to a grammar located at the corresponding gram- 
mar address. 

3. The invention according to claims 2, 13 or 21, 
wherein the address corresponding to the location 5 
of a grammar is a uniform resource locator-compat- 
ible address. 

4, The invention according to claims 1. 12, 20 or 28. 
wherein the information representing speech io 
arrives from the client in a streaming manner; or 

wherein the information representing speech 
received from the client conprises digitized 
speech; or 

wherein the information representing speech is 
received from the client conprises compressed dig- 
itized speech; or 

wherein the information representing speech 
received from the client comprises features 
extracted by the client from digitized speech. 20 

5, The invention according to claim 1, wherein the 
step of recognizing the received speech irrformation 
is repeated as new speech information is received 
from the client- 25 

6. The invention according to claims 1,9, 12, 17, 20, 
25 wherein the information based upon the recog- 
nized speech conprises text information; or 

wherein the information based upon the rec- 30 
ognized speech comprises additional speech. 

7. The invention according to claim 1, wherein the 
step of sending information based upon the recog- 
nized speech is repeated as additional speech as 
information is recognized. 

8, The invention according to claim 7. further coopris- 
ing the step of sending to the client revised informa- 
tion based upon recognized speech previously sent 40 
to tfie client. 



network, comprising: 

a. a programmable processor; 

b. memory; 

c. an audio input device; and 

d. a communications interface for establishing 
a communications link with the client over the 
packet network; 

wherein said processor is programmed 
to execute the steps of: 

i. receiving from the client over the packet 
network information conresponding to a 
grammar used for speech recognition; 

ii. receiving from the client over the packet 
network information representing speech; 
ifi- recognizing the received speech infor- 
mation by applying an automata speech 
recognition algorithm in accordance with 
the grammar; and 

fv- sending information baised upon the 
recognized speech over the packet net- 
work to the client. 

13. The invention according to daim 12. wheren the 
processor is further programmed to execirte the 
step of if the information conresponding to a gram- 
mar is an address corresporxfing to the location of 
a grammar, obtaining access to a grammar located 
at the corresponding grammar address. 

14. The invention according to daim 12, wherein the 
processor is further programmed to repeat the step 
of recognizing the received speech information as 
new speech information is received from the dient 

15. The invention according to daim 12, wherein the 
processor is further programmed to repeat the step 
of sending information based upon the recognized 
speech as add'rtional speech information is recog- 
nized. 



9. The invention according to daim 1. wherein the 
steps of b. c and d are repeated to create an 
exchange of information between client and server. 4s 

10. The invention according to daims 1 or 28. further 
comprising the step of activating a grammar rule in 
response to a request received from the dient over 
the packet network so 

11. The invention according to daims 1 or 28. further 
conprising the step of sending over the packet net- 
work to the dient a handle corresponding to the 
grammar. 55 

12. A system for operating art automatic speech recog- 
nition service accessit^le by a dient over a packet 



16. The invention according to daim 15. wherein the 
processor is further programmed to execute the 
step of sending to the dient revised information 
based upon recognized speech previously sent to 
the client. 

17. The invention according to daim 12. wherein the 
processor is further programmed to repeat the 
steps of b. c and d to create an exchange of infor- 
mation t>etween client and server. 

18. The invention according to daim 12. wherein the 
processor is further programmed to execute tfie 
step of activating a grammar rule in response to a 
request received from the client over the packet net- 
work. 
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19. The invention according to claim 12, wherein the 
processor is further programmed to execute the 
step of sending over the packet network to the client 
a handle corresponding to the grammar, 

5 

20. An article of manufacture, comprising a computer- 
readaWe medium having stored thereon instruc- 
tions for operating an automatic speech recognition 
service accessible by a dient over a packet net* 
worK said instructions which, when performed by a io 
processor, cause the processor to execute a series 

of steps comprising: 

a. receiving from the dient over the packet net- 
work information corresponding to a grammar is 
used for speech recognition; 

b. receiving from the client over the packet net- 
work information representing speech; 

c. recognizing the received speech information 

by applying an automatic speech recognition 20 
algorithm in accordance with the grammar; and 

d. sending information based upon the recog- 
nized speech over the packet network to the cli- 
ent 

25 

21. The invention according to daim 20, wherein the 
instructions, when performed by a processor, fur- 
ther cause the processor to execute the step of if 
the information corresponding to a grammar is an 
address corresponding to the location of a gram- 30 
mar, obtaining access to a grammar located at the 
conresponding grammar address. 

22. The invention according to daim 20. wherein the 
instructions, when performed by a processor, fur- 35 
ther cause the processor to repeat the step of rec- 
ognizing the received speech information as new 
speech information is received from the client 

23. The invention according to daim 20. wherein the 40 
instructions, when performed by a processor, fur- 
ther cause the processor to repeat the step of send- 
ing information based upon the recognized speech 

as additional speech information is recognized. 

45 

24. The invention according to daim 23. wherein the 
instructions, when performed by a processor, fur- 
ther cause the processor to execute the step of 
sending to the dient revised information based 
upon recognized speech previously sent to the cli- so 
ent 

25. The invention according to daim 20, wherein the 
instructions, when performed by a processor, fur- 
ther cause the processor to repeat the steps of b. c ss 
and d to create an exchange of information 
between dient and server. 



26. The invention according to claim 20. wheran the 
instructions, when performed by a processor, fur- 
ther cause the processor to execute the step of acti- 
vating a grammar rule in response to a request 
received from the client over the packet network. 

27. The invention according to claim 20. wherein the 
instructions, when performed by a processor, fur- 
ther cause the processor to execute the step of 
sending over the packet network to the client a han- 
dle conresponding to the grammar. 

28. A method of operating an automatic form filling 
service accessible by a client over a packet net- 
work, comprising the steps of: 

a. receiving from the dient over the packet net- 
work information corresponding to a grammar 
used for speech recognftion. wherein said 
grammar corresponds to words associated 
witfi text information to be inserted in the form; 

b. receiving from the client over the packet net- 
work information representing speech; 

c. recognizing the received speech information 
by applying an automatic speech recognition 
algorithm in accordance with the grammar; and 

d. sending text corresponding to the recog- 
nized speech over the packet network to the cli- 
ent for insertion In the form. 
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FIG. 4 
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